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Abstract 

In this paper we introduce a new technique for the detection of outliers 
in contingency tables. Outliers are defined with respect to classical loglinear 
Poisson models. For the computations, we define an algorithm based on the 
notion of minimal patterns to obtain robust identification of the outlying cell 
counts. Minimal patterns are suitable subsets of cell counts corresponding 
to non-singular design matrices, and therefore leading to valid maximum- 
likelihood estimates of the model parameters and of the cell counts. A cri- 
terion to easily produce minimal patterns in the two-way case under inde- 
pendence is introduced, based on the analysis of the positions of the chosen 
cells. A simulation study and a couple of real-data examples are presented 
to illustrate the performances of this new algorithm, and to compare it with 
other existing methods. 
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1 Introduction 

In every statistical analysis, observations can occur which "appear to be incon- 
sistent with the remainder of that set of data" ( [Barnett and Lewis} |1994| ). The 



same authors also name outliers in contingency tables among little-explored ar- 
eas, which is up to day still true. For two-way tables outliers have been treated 
in a couple of research papers in connection with the multinomial model, e.g., by 
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employing residuals and by defining suitable tests based on them in their detec- 
tion dSimonofiJ |1988| |Fuchs and Kenettj |1980| |Gupta et alj |2007| ). Approaches 
for the detection of outliers in higher-dimensional tables with respect to the Pois- 



son model are also found in the literature (e.g. Upton and Guillen ( 1995), Kuhnt 



In the context of contingency tables we deal with outlying cells rather than 
individual outlying observations contributing to the cell counts. Therefore, the 
detection of outliers in contingency tables is based on a sample of size one for 
each cell count, and this fact implies that any detection procedure must be de- 
fined with the greatest caution. Additionally, for more than one outlying cell, 
their position in the table can be crucial with respect to their identification as well 
as their effect on data analysis methods. This fact has been recognized in the 
discussion of outlier detection methods and breakdown concepts for contingency 
tables by Kuhnt (2000 2010). Also[Rapallo ( 2012[ ) introduces a notion of patterns 
of outliers in connection with goodness-of-fit tests by applying techniques from 
algebraic statistics. 

In this paper we follow a new approach towards outlier identification in con- 
tingency tables. Going back to the general notion of outliers as observations de- 
viating from a structure supported by the majority of the data we define so-called 
minimal patterns. These sets cover more than half of the observations while at the 
same time containing just enough cells to ensure full rank of the subdesign matrix 
of a loglinear model. For each pattern the remaining cell counts are candidate 
outliers. Finding these patterns is not an easy task, for the independence model 
in two-way tables we derive theoretical results on the nature of minimal patters. 
We suggest two possible algorithms to identify outliers by running through all 
minimal patterns and using the notion of a-outliers. 

The paper is organized as follows. Section [2] briefly recalls oc-outliers with 
respect to loglinear Poisson models and one- step outlier identification methods 
based on ML- or L\ -estimators. In Section|3]we define (strictly) minimal patterns 
and present two outlier detection methods with minimal patterns, called OMP and 
OMPC, the latter identifying cell counts with the highest count of being an outlier 
with respect to a minimal pattern. Some results on the connection between mini- 
mal patterns and cycles in subtables are derived in Section|4]for the independence 
model in two-way tables. The performance of the different outlier identification 
methods compared by a simulation study in Section [5] and applications to data 
sets from the literature in Section |6j Finally, in Section [7] some conclusions and 
comments are made. 
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2 Loglinear Poisson models, estimators and a -outliers 



We consider contingency tables with N cell counts, assumed to be realizations of 
random variables Yj, j = 1, ...,N, from a loglinear Poisson model. These models 
may be presented as generalized linear models ( |Agresti[ [2002 ) with structural 
component 

E(Y J )=cxp(x' J P)=:m J: j=l,...,N 

where X G M. pxN is the design matrix of full rank and /3 G IR P the unknown pa- 
rameter vector. The maximum likelihood (ML-)estimator of /3 is given by 



j6 



ML 



N 



(i) 



A more robust alternative is the Li-estimator (Hubert 1997 ) 



N 



/3 L >= argmin V | log Yj - xfS | . 

(■!. -.:■■■/' ' J 



(2) 



Generally, the notion of outliers as surprising observations far away from the 



bulk of the data has been formalized by so-called a-outlier regions (Davies and 



Gather 1993). Thereby observations which are located in a region of the sample 



space with a very small probability of occurrence with respect to a given model 
are defined as outliers. A formal definition of outliers in contingency tables is 
given in |Kuhnt| ( |2004l ): 

Definition 1. An observed cell count yj is called an a-outlier with respect to a 
loglinear Poisson model if it lies in the outlier region 

out{a,Poi{mj)) = {y G N : poi(y,nij) < K(a)}, 

where poi(-,mj) denotes the probability density function of a Poisson random 
variable, a G (0, 1), andK(a) = sup{^ > : X^eN poi(y,mj)l^ K ^(poi(y,mj)) < 
oc}, where 1a(x) is the indicator function. 

Using this notion, one- step outlier identifiers are easily derived, which are 
defined as follows. 

Definition 2. Let a G (0, 1) be given. A one-step outlier identifier based on the 
L\-(or ML- ) estimator is defined by the following procedure: 
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( i) Estimate rhj 1 j= 1 , . . . , Af, for the loglinear Poisson model based on the com- 
plete contingency table by the L\- (or ML-) estimator. 

(ii) Identify cell counts yj in a-outlier regions with respect to Poiifhj) as out- 
liers. 

The choice of a for the one- step outlier identifiers in relation to the size N of 
the table is discussed in Kuhnt (2004). These identifiers are compared in Section 
[5] to the new methods developed next. 



3 Detecting outliers based on minimal patterns 

Consider the notion of outliers as observations which are deviating from a model 
structure supported by the majority of the data. Here this model is assumed to 
be a loglinear model characterized by its design matrix X. We look at patterns in 
the table, given as subsets of the cells, which cover at least half of the table but 
not more observations than necessary to ensure a full rank design matrix. These 
patterns are seen as potential core sets of the majority of the data from which 
individual observations deviate. 

Definition 3. Let X be the design matrix of a log-linear model with parameter 
space W. A subset of cells is called a minimal pattern if 

(i) the subset has at least [^J + 1 elements; 

( ii) the corresponding submatrix ofX is of full rank; 

( Hi) the subset has the minimal number of elements necessary to fulfill condition 
( i) and condition ( ii). 

Restricting the considered subset of the cells to those necessary to uniquely 
define model parameters leads to the definition of strictly minimal patterns. 

Definition 4. Let X be the design matrix of a log-linear model with parameter 
space IR P . A subset of p cells is called a strictly minimal pattern if the corre- 
sponding submatrix ofX is of full rank. 

If p = [yj + 1 holds, then strictly minimal and minimal patterns coincide. 
In case of p < |_^J + 1, adding |_^J + 1 — p arbitrarily chosen cells to a strictly 



4 



minimal pattern returns a minimal pattern. Note that not all subsets with p cells 
yield non- singular matrices. 

Before developing algorithms for the detection of outliers based on minimal 
patterns we fix some notation. Let W be the set of all W minimal patterns and 
X the full design matrix in the loglinear Poisson model. Each column of X cor- 
responds to a cell in the contingency table. Taking only the columns of X which 
correspond to the cells of each minimal pattern yields X Wl w = 1, W. 

The idea behind the new outlier detection methods is to run through all min- 
imal patterns and consider each of them as outlier-free subset of the table. The 
maximum likelihood estimate from these cells provides estimated mean values 
for all cells. We check for all cell counts outside the pattern if they lie in the oc- 
outlier region with respect to the Poisson distribution given by the estimate. Those 
cells for which this is true make the set of outliers with respect to the minimal pat- 
tern. Hence, we get a set with outliers for each minimal pattern. A first algorithm 
on the detection of outliers with minimal patterns called OMP is defined in Algo- 
rithm Q] and identifies the set with the minimal number of elements as identified 
outliers. 

Algorithm 1 Outlier detection with minimal patterns (OMP) 
for w = 1 to W do 

"gg" (li^n^^-jS-exp^.jS))). 
for j = 1 to N do 

Determine out(a 1 Poi(m y j)) for mj based on exp(^/3 v ^ L ) 
end for 

NUMB.OUT w <— Number of outliers for minimal pattern w 
end for 

for w = 1 to W do 

if NUMB.OUT w = min(NUMB.OUT) then 

Outlier pattern <— Cells with outliers identified with minimal pattern w 

end if 
end for 



Notice that in Algorithm [T] the minimum number of outliers may be attained 
for more than one minimal pattern. Then more than one solution exist and dif- 
ferent possible outlier patterns are identified, which can be discussed based on 
knowledge of the subject. 

A slightly different alternative to OMP is implemented in Algorithm |2} called 
outlier detection with minimal patterns and the count method (OMPC). Here we 
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count how often each cell is identified as outlier with respect to a minimal pattern. 
If the cell is identified in more than half of the cases it is identified as outlier. We 
denote the number of minimal patterns not including the cell by r. The choice of 
the value r/2 as a cut-off in order to discriminate between outliers and inliers is 
briefly discussed in Section [5] 

Algorithm 2 Outlier detection with minimal patterns and the count method 

(OMPC) 

for w — 1 to W do 

AT L ^ «jg« (li^n^^-^-exp^.jS))). 
for j = 1 to N, j £ w do 

Determine out(a,Poi(iWJ)) for mj based on exp(^/3 v ^ L ) 
end for 
end for 

r <— absolute frequency of each cell not contained in a minimal pattern 
if E out(a,Poi(mJ))J (£w)> r/2 then 

yj is an outlier 
end if 



When W becomes large and the enumeration of all minimal patterns is not 
feasible, it is possible to introduce a standard Monte Carlo approximation in the 
algorithms. 

4 Minimal patterns and cycles in the independence 
model 

Running through all possible subsets of dimension p to determine the strictly min- 
imal patterns quickly becomes unfeasible for larger dimensional tables. It is there- 
fore important to analyze the structure of these patterns in more detail. 

We focus on the loglinear independence model for two-dimensional I xj con- 
tingency tables, assuming without loss of generality I < J. The design matrix X 
can be expressed as: 

X = [a ,ri,...,r/_i,ci,...,c/_i]', (3) 

where ciq is a unit vector, r\ is the indicator vector of the first row, c\ is the indica- 
tor vector of the first column, and so on. For instance, the design matrix for 3x3 
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tables is: 



X 



(\ 1 1 1 1 1 1 1 l\ 

1 1 1 

1 1 1 

1 1 1 
\0 1 1 1 0/ 



(4) 



Another classical representation of the same model is given by the design matrix 
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(5) 



We will use the latter parametrization in the simulation study as it is the usual 
parametrization implemented in the software for loglinear models, while we use 
the former parametrization in the proofs, as many formulae become easy to han- 
dle. 

In this model, the relevant parameter space for the unknown parameter vector 
/3 is ]R( /+ ' /_1 \ Table [T] shows that the number of possible patterns with p = I + 
7—1 cells as well as the number of (strictly) minimal patterns increases quickly 
for higher dimensional tables. 



Table 


3x3 


2x5 


3x4 


3x5 


4x4 


3x6 


4x5 


p=I+J-\ 
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6 


6 
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7 
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8 


♦ = L?J + i 


5 


6 


7 


8 


9 


10 


11 





126 


210 


792 


6435 


11440 


43758 


167960 


W = # min. patterns 


81 


80 


612 


3780 


9552 


26325 


139660 


# str. min. patterns 


81 


80 


432 


2025 


4096 


41066 


105408 



Table 1 : Number of minimal patterns for different independence models 



Example 1. In the case of 3 x 3 tables, the two configurations below have different 
behavior: 



( the -k 's denote the chosen cells). The configuration on the left hand side produces 
a singular submatrix, while the configuration on the right hand side produces a 
non-singular matrix, and hence it is a strictly minimal pattern. At a first glance, 
we note that in the singular case there is a complete 2x2 subtable among the 
chosen cells, while in the other case it is not. The relevance of '2x2 subtables in 



the study of the independence model is well known, see e.g. Agresti (2002), and a 



different perspective within the field of Algebraic Statistics is investigated in e.g. 



Rapallo (2003). However, the simple notion of a 2x2 submatrix is not sufficient 



to effectively describe the problem, as shown in the following example: 



In this case, the chosen configuration does not contain any 2x2 submatrices, and 
nevertheless the corresponding submatrix is singular. 

To explore the structure of patterns in the table we need to introduce the notion 
of fc-cycle. 

Definition 5. Let k > 2. A k-cycle is a set of 2k cells contained in akxk subtable, 
with exactly 2 cells in each row and in each column of the submatrix. 

Example 2. In view of Definition^ a 2-cycle is simply a 2 x 2 submatrix, while 
a 3 -cycle is a set of 6 cells of the form 



In case of the independence model, the following theorem shows that the cy- 
cles are the key ingredient to check whether a subset of p cells is a strictly minimal 
pattern. 

Theorem 1. A set of p = I + J — 1 cells forms a strictly minimal pattern for the 
independence model if and only if it does not contain any k-cycles, k = 2, . . . 

Proof. First, note that a cycle can be decomposed into two subsets of k cells each 
with one cell in each row and in each column. It is enough to sum the columns 
of the design matrix X with coefficient +1 for the cells in the first subset and 
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with coefficient —1 for the second subset and we obtain a null vector. Thus, the 
submatrix is singular and the set does not form a strictly minimal pattern. 

Conversely, if the submatrix is singular, then there is a null linear combination 
among the columns of the submatrix, with coefficients not all zero. Denote with 
C( ; j) the column of the design matrix corresponding to the cell Therefore, 
we have 

nc(i ujl ) + ... + YpC(i pJp )=0 (6) 

and the coefficients j\ , . . . , y p are not all zero. Without loss of generality, suppose 
that f[ > 0. As the indicator vector of row i\ belongs to the row span of X and 
the same holds for the indicator vector of column j\, we must have: a cell in the 
same row (z'2,7'2) = O17J2) w i tn negative coefficient in Eq. ([6]); a cell in the same 
column (?3, 73) = (13, ji) with negative coefficient in Eq. ([6]). Therefore, there 
must be a cell in row ?3 and a cell in column 72 with positive coefficients. Now, 
two cases can happen: 

• if the cell (13, 72) is a chosen cell and its coefficient in Eq. ([6]) is positive, 
we have a 2-cycle; 

• otherwise, we iterate the same reasoning as above, with another pair of cells. 

This shows that there exists a certain number k of rows (k > 2), and the same 
number of columns, with two cells each with a non-zero coefficient. Such cells 
form by definition a fc-cycle. □ 

As a corollary, the following algorithm produces strictly minimal patterns: 

1. Let be the set of all cells of the table, and ,Y = the set of the chosen 
cells. 

2. For 9 e{l,...,/+J-l}: 

• Choose a cell uniformly from ^ ', add it to y, and delete it from c €\ 

• Find all 3-tuples, 5-tuples and so on of cells in ,Y containing the cho- 
sen cell and delete from ^ all cells (if any) producing 2-cycles, 3- 
cycles and so on. 

Notice that the first three cells are chosen without any restrictions. Moreover, 
as the algorithm is symmetric on row and column permutation, one has that the 
strictly minimal pattern is selected uniformly in the set of all strictly minimal 
patterns. 
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For 3x3 tables, our statement is equivalent to another criterion, to be found 
in|Kuhnt|(|2000l). 



Corollary 1. For the independence model for 3x3 tables, the absence ofl-cycles 
is equivalent to: 

( i) no empty rows; 

( ii) no empty columns; 

( Hi) for each selected cell, there is at least another cell in the same row or in the 
same column. 

Proof. Suppose that there is an empty row. In the remaining two rows we have 
to put 5 cells, and a 2-cycle must appear. The same reasoning holds in the case 
of an empty column. Finally, if there is a selected cell, say with no other 
cells in the same row or in the same column, we exclude for the remaining 4 cells 
of the minimal pattern the 5 cells of the z'-th row and of the j-column. Thus the 
remaining 4 cells are forced to constitute a 2-cycle. 

On the other hand, suppose that there is a 2-cycle, and suppose without loss 
of generality that the cycle is formed by the cells (1, 1), (1,2), (2, 1), (2,2). The 
last selected cell can be chosen in 5 different ways. In two cases, (1,3) or (2,3), 
we have an empty row; in two cases, (3, 1) or (3,2), we have an empty column; 
in the last case, (3,3), this cell has no other cells in the same row or in the same 
column. □ 

In case of the independence model for two-way tables, we can define an algo- 
rithm to efficiently sample minimal patterns. Our strategy is as follows: 

(a) First, choose exactly p = I + J — 1 cells to define a non-singular submatrix 
of the design matrix X. 

(b) Add randomly chosen cells in order to reach the desired number, which is 

Lfj + i. 



5 Simulation study 



In the previous sections we presented different methods to identify a-outliers. To 



compare different outlier identifiers, Kuhnt (2010) discusses breakdown points of 
the methods. For the OMPC and the OMP methods, it is not clear if breakdown 
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points or similar criteria can be derived theoretically at all. Hence we present three 
loglinear Poisson models with varying outlier situations, conduct simulations and 
check whether the methods (one-step ML (OML), one-step L\ (OL1) and OMPC) 
detect outliers and inliers correctly. We exclude OMP from the comparison as it 
might lead to results which are not unique and therefore not directly comparable. 

For the following simulations we adapt the notion of "types" and "antitypes" 
from Configural Frequency Analysis ( von Eye[ 2002| ). A type is defined as a cell 



in a contingency table with a higher value than the upper bound of the corre- 
sponding a-inlier region, an antitype is a smaller value than the lower bound of 
the corresponding a-inlier region. 

We judge the different methods by the proportion of correctly identified out- 
liers and inliers. 

We consider three different loglinear Poisson models ((3 x 3), (4 x 4) and (10 x 
10)) and insert various outlying values in the simulated contingency tables. For 
example, we vary the a-value which determines the outlyingness of the inserted 
value. All outlier identification methods are always calculated with 0.01-outlier 
regions of the model given by the parameter estimate. 

The six simulated scenarios are described below. The simulations were per- 



formed with R (R Development Core Team] |2010| ) and the results are given in 
Tabled 

1. We generate 100 3 x 3 contingency tables with X = Xt, X 3 and 
j8i = (4,0.2,-0.2,0.4,0.3)' with only one a-outlier (a = 10~ 4 ) in cell 
(1,1). Since the position of one outlier in the table is unimportant we place 
the outlier in the first row and column of each table. The outlier can be seen 
as a moderate outlier. For the cell (1,1), the outlier region with respect to a 
Poisson distribution is given by: 

[0,outi eft ) U (out right ,oo) = [0,63) U (140, oo) 

such that the value 62 is inserted as antitype and 141 as type. 



2. Since 3x3 contingency tables have been analyzed in Kuhnt (2000) exten- 
sively, we move to larger tables. We then consider 100 4 x 4 tables based 
on j8 2 = (3.8, 0.2, -0.2, 0. 1 , 0.25, 0.3, -0. 1 )'. Here we consider tables with 
only one moderate a-outlier (a = 10~ 4 ) in cell (1,1). 

3. Again we generate 100 4x4 tables based on fa. To see how the methods 
work with several outliers, we added another a-outlier (a = 10~ 4 ) resulting 
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in three different situations: Two types, two antitypes, and one type and one 
antitype. In this scenario, we inserted the outliers in cells (1,1) and (1,2). 
Notice that the presence of two outliers in the same row can manipulate the 
estimators of that row in a notably way. 

4. We reconsider the situation from the third scenario with fa. This time, we 
replace two values on the main diagonal of the contingency table with a- 
outliers (a = 10~ 4 ) in cells (1,1) and (2,2). In this case, the two outliers 
affect different parameter estimates. 

5. The last simulation with 4x4 tables based on fa is similar to the third 
scenario, but here the outlyingness of the replaced values in cells (1,1) and 
(1,2) differs. Now, a = 10~ 8 is considered. 

6. We finish the simulation studies with the generation of 100 large 10 x 10 
contingency tables. The corresponding parameter vector is 

fa = (3.3,0.2,-0.2,0.1,0.25,0.3,-0.1,0.4,0.2,0.1, 
0.2, -0.4, 0.2, -0.2,0.1, 0.0, 0.1, -0.3, 0.1)'. 

The a-outliers have been replaced in cell (1,1) and cell (2,3), with a = 
10 4 . The number of minimal patterns we consider here is constrained to 
500. The patterns have been chosen randomly, therefore the frequency of 
each cell in all patterns can not be equal to the frequency to every other cell. 
But we can handle this slight discrepancy by adjusting the number of times 
each cell has to be detected as an outlier in the OMPC method. 

Analyzing the results in Table [2j some comments are now in order. 

• Scenarios 1 and 2 show that classical one-step methods OML and OL1 do 
not have a satisfactory behavior for small tables. In particular, the OML 
method fails to detect the outlier in at least 85% of cases. Among the one- 
step algorithms, OL1 presents better results. On the other hand, OMPC has 
a proportion of correctly classified outliers notably higher than the one-step 
methods. 

• Scenarios 3 and 4 prove that the position of the outlying cells within the 
table is a major issue. In fact, placing the two outliers in the same row, 
the proportion of correctly classified outliers reduces considerably. This 
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Scenario 








Estimator 


nn 
outliers 


= OZ 

inliers 


nn = 
outliers 


141 
inliers 


1 




OML 
OL1 
OMPC 


0.000 
0.320 
0.680 


0.999 
0.963 
0.754 


0.010 
0.480 
0.820 


0.999 
u.y /4 
0.773 






Estimator 


nn 
outliers 


= 39 
inliers 


nn = 
outliers 


105 
inliers 






OML 
OL1 
OMPC 


0.150 
0.620 
0.890 


0.999 
0.979 
0.899 


0.050 
0.620 
0.900 


0.999 
0.989 
u.yuy 


3 


Estimator 


2 antitypes 1 type, 1 antitype 
«n = 39,ni2 = 42 «n = 39,«i2 = 110 
outliers inliers outliers inliers 


2 types 
nn = 105, nyi = 110 
outliers inliers 


OML 
OL1 
UMrL 


0.000 0.992 
0.035 0.960 
0.435 0.8679 


0.630 
0.725 
1.000 


0.996 
0.986 
0.8779 


0.000 0.997 
0.200 0.983 
U.4/U u.yui4 


4 


Estimator 


2 antitypes 
nn = 39,n 2 2 = 23 
outliers inliers 


1 type, 1 antitype 
nn = 39,n 2 2 = 79 
outliers inliers 


2 types 
nn = 105, n 2 2 = 79 
outliers inliers 


OML 
OL1 
UMrL 


0.450 
0.740 
0.975 


0.989 
0.976 
0.804 


0.020 
0.495 
0.840 


0.996 
0.980 
0.834 


0.210 0.990 
0.635 0.984 

U.yt)J U.OJ/ 


5 


Estimator 


2 antitypes 1 type, 1 antitype 
nil = 27,ni2 = 29 «ii = 27,ni2 = 128 
outliers inliers outliers inliers 


2 types 
nn = 124,ni2 = 128 
outliers inliers 


OML 
OL1 
OMPC 


0.105 
0.140 
0.880 


0.953 
0.896 
0.771 


0.985 
0.980 
1.000 


0.979 
0.987 
0.643 


0.000 0.986 
0.450 0.969 
0.855 0.829 


6 


Estimator 


2 antitypes 
nn = 18,n 2 3 = 9 
outliers inliers 


1 type, 1 antitype 
nn = 18,n 23 =49 
outliers inliers 


2 types 
nn = 67,n 2 3 = 49 
outliers inliers 


OML 
OL1 
OMPC 


0.961 
0.963 
0.990 


0.998 
0.991 
0.940 


0.865 
0.936 
0.990 


0.999 
0.992 
0.953 


0.847 0.999 
0.935 0.992 
1.000 0.956 



Table 2: Proportions of correctly classified outliers and inliers in the 6 simulation 
scenarios. 
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phenomenon is particularly evident in case of two types or two antitypes, 
since in such cases the outliers give rise to relevant changes in the parameter 
estimates. With two outliers in the same row we find again that the OML 
method is unreliable. 

• Comparing scenarios 3 and 5, we observe that all procedures perform better 
in finding outliers when the outlyingness of the two cells is higher. 

• Scenario 6 shows that the proposed methods are still valid for larger tables, 
even though the differences between the three methods become less rele- 
vant. 

• Finally, in all simulations the OMPC algorithm is slightly less efficient in 
detecting inliers. This means that in some few cases it finds more outliers 
than expected. This issue will be discussed again after the real data exam- 
ples, in connections with the behavior of the OMP method. 

Finally, some few words about the choice of r/2 as the cutoff value in the 
OMPC algorithm. If we consider a different cutoff of the form hr (0 < h < 1), we 
obtain in the second scenario above the graph in Figure [TJ where the proportion 
of correctly classified inliers and outliers when h ranges between and 1, and the 
graphs for the other situations present a quite similar shape. This graph shows that 
r/2 is a reasonable choice. Nevertheless, for special applications this value may 
be modified. For instance, when we want to prevent the detection of too many 
outliers, one can set up the cutoff at an higher value. 



6 Case studies 



6.1 Artifacts discovered in Nevada 



To see how the previously mentioned algorithms work compared to standard pro- 
cedures, we look at the data in Table |3~1(Mosteil er and Parunak[|2006l ). This table 



shows how far away from permanent water certain types of archaeological arti- 
facts have been found. 

The OML method as well as the OL1 method yield no outliers for a = 0.001. 
This holds also for the OMP method. In contrast, the OMPC method finds two 
outliers for a = 0.001, i.e. cells (3, 1) and (3,2). Looking at this method with a 
smaller a = 0.0005 we find that only cell (3, 1) stays an outlying cell in the OMPC 
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Figure 1: Correctly classified inliers and outliers for varying h in Scenario 2. 







Distance from permanent water 






Immediate 


Within 


0.25 - 0.5 0.5 - 1 






vicinity 


0.25 miles 


miles mile 




Drills 


2 


10 


4 2 


Artifact 


Pots 


3 


8 


4 6 


type 


Grinding stones 


13 


5 


3 9 




Point fragments 


20 


36 


19 20 



Table 3: Archaeological finds discovered in Nevada, from Mosteller and Parunak 



(2006) 



method. This dataset has also been studied in Simonoff ( 1988 1, where cell (3, 1) 
has been declared as "sure outlier" and cell (3,2) can be seen as a border-line 
situation. 



6.2 Social networks 



McKinley (1973) present a study concerning lay consultation and help-seeking 
behavior based on eighty-seven working-class families in Aberdeen. We consider 
a three-dimensional table on friendship networks of pregnant woman from this 
data set. The first variable concerns the frequency of interactions with friends, 
measured as daily {X\ = 1), once a week or more {X\ = 2) and less than once a 
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week (X\ = 3). The geographic proximity to the friends is covered by variable X2 
with the categories walk (X2 = 1) and bus (X2 = 2). The last variable X3 states 
whether the woman is pregnant with the first (X3 = 2) or a further child (X3 = 1). 
The data are summarized in Table HI 





X2: Distance 
Xy. Parity 


1: Walk 
1: Not first 2: 


First 


2: Bus 
1: Not first 2: 


First 


X\\ Frequency 


1: Daily 


30 


6 


2 


13 


of visits 


2: Weekly 


19 


12 


16 


8 




3: Less often 


5 


12 


10 


4 



Table 4: Data set on social networks from|McKinley| (p~973|). 



The model we consider assumes the conditional dependence between X\ and 
X3 given X2 and has design matrix 



X 



1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 














-1 


-1 


-1 


-1 














1 


1 


1 


1 


-1 


-1 


-1 


-1 


1 


1 


-1 


-1 


1 


1 


-1 


-1 


1 


1 


-1 


-1 


1 


-1 


1 


-1 


1 


-1 


1 


-1 


1 


-1 


1 


-1 


1 


1 


-1 


-1 














-1 


-1 


1 


1 














1 


1 


-1 


-1 


-1 


-1 


1 


1 


1 


-1 


-1 


1 


1 


-1 


-1 


1 


1 


-1 


-1 


1 



Running the four outlier identification methods, we obtain that with the OML 
method all observations are classified as inliers. The OL1 method yields two 
outliers in the two extreme values n\\\ =30 and n\2\ =2. 

Now we compare the previous results with those yielded by minimal patterns. 
8 J = 495 sets with eight elements each and 144 of them fulfill Defi- 
nition[3j The minimal patterns yield 40 times three outliers, 88 times two outliers 
and 16 times one outlier. Therefore we look at those cases where the OMP method 
found only one outlier, more precisely cell n\2\ and cell /7122 (eight times each). 
So, this method yields two different solutions. 

The OMPC method produces similar results. A cell can be detected as an 
outlier 48 times at most. The cells «in,«i2i,«i22 have been detected 48 times, 
cells «3n and 71312 have not been detected as outliers, the rest of the cells have 
been found 24 times, hence 50% of the possible cases. It is conspicuous that a cell 
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is either always an outlier, in 50% of the cases or not at all. This fact holds also 
for other cell counts and the given model. Furthermore, we are not interested in 
having 10 outliers and 2 inliers, that's why we declare only the cells n\ \ \ , n\2\ , ^122 
as outliers. The comparison of the results from the four methods are summarized 
in Table [5] 





"in 


"112 


n\2\ 


"122 


"211 


"212 


"221 


"222 


"311 


"312 


"321 


"322 


OML 


























OL1 


* 




* 




















OMP 






* 




















OMPC 






* 





















Table 5: Identification results for the Social Network example. 



Upton (1980]) and |Upton an d Guillen (1995]) also analyze the given contin 



gency table with regard to outliers. They state that H122 should be regarded as 
an outlier because many pregnant women are still working and get there by bus. 
There they see their co-workers who are also their friends. This cell has been 
detected as one of the two solutions of the OMP method, which supports the hy- 
pothesis that it works good for a reasonable model and rather small contingency 
tables. The OMPC method also detected ^122 as an outlier, but not as the only 
one. 



6.3 Study of social mobility in Britain 

As a final example, we briefly present the results on an example dataset from|Glasi 



and Berent ( 1954). The status categories of fathers and their sons are put together 
in a (7 x 7)-contingency table. Goodman] ( |1971 ) merges certain classes which 
yields the 3x3 contingency table in Table [6j 





high 


Son 
middle 


low 


high 


588 


395 


159 


Father middle 


349 


714 


447 


low 


111 


320 


411 



Table 6: Status categories of fathers and sons from Glass and Berent ( 1954). 



Here, OMP identifies the observations rcn,«22,«33 as outliers. The OMPC 
method identifies every cell as an outlier, which seems surprising on the one hand, 
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but on the other hand it is coherent since the underlying independence model is 
obviously the wrong one. The choice of the model seems to be more important to 
the OMPC method than to the others. The OMP method yields the only intuitively 
plausible outlier pattern with the main diagonal. A potential alternative is given 
by the OL1 method («n, 013,031 and ^33 are outliers), while the OML (7 outliers) 
and the OMPC offer no satisfying results in this case. 

7 Conclusions 

From the simulations and the real data examples, we can now summarize the main 
features of the outlier detection algorithms considered here. 

About the one-step procedures, the L\ -estimates are not as prone to swamp- 
ing as the ML-estimates, which is why the OML method is in general the more 
conservative one. The OMPC method in most cases outperforms the one-step pro- 
cedures, and the examples suggest that also the OMP method works better than the 
one-step procedures. In all scenarios of the simulation study, the OMPC method 
produces a proportion of correctly classified outliers substantially higher than the 
other procedures. 

On the other hand, the detection of outliers becomes difficult when there are 
several outliers in one row or in one column (see the third scenario), and more 
generally the detection is not easy when the proportion of outliers with respect to 
the number of cells is high, as shown in the last example. However, in practice we 
expect to have few outlying cells compared to the dimension of the table. Finally, 
when the outlyingness is higher (see the fifth scenario), the methods identify more 
outliers as outliers, but also more inliers as outliers. 

Of course, it is worth noting that the experiments performed here are not ex- 
haustive. Several further simulations should be implemented to explore the per- 
formances of the minimal patterns algorithms, and to adjust the simulation param- 
eters. In particular, the behavior of our algorithms for large sparse tables, or for 
tables with zero cell counts, still needs to be explored. 
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